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Abstract 

A  critical  challenge  in  synthesis  techniques  for  itera¬ 
tive  applications  is  the  efficient  analysis  of  performance  in 
the  presence  of  communication  resource  contention.  To 
address  this  challenge,  we  introduce  the  concept  of  the 
period  graph.  The  period  graph  is  constructed  from  the  out¬ 
put  of  a  simulation  of  the  system,  with  idle  states  included 
in  the  graph,  and  its  maximum  cycle  mean  is  used  to  esti¬ 
mate  overall  system  throughput.  As  an  example  of  the  utility 
of  the  period  graph,  we  demonstrate  its  use  in  a  joint 
power/performance  optimization  solution  that  uses  either  a 
nested  genetic  algorithm,  or  a  simulated  annealing  algo¬ 
rithm.  We  analyze  the  fidelity  of  this  estimator,  and  quantify 
the  speedup  and  optimization  accuracy  obtained  compared 
to  simulation. 

1  Introduction 

In  many  practical  multiprocessor  systems,  there  is 
contention  for  one  or  more  shared  communication 
resources.  One  example  of  this  is  a  shared  bus.  A  processor 
must  first  gain  access  to  the  bus  before  it  can  execute  an 
interprocessor  communication  (IPC)  operation.  One  conse¬ 
quence  of  this  contention  is  that  under  self-timed,  iterative 
execution,  there  is  no  known  method  for  deriving  an  analyt¬ 
ical  expression  for  the  throughput  of  the  system  [14],  and 
thus,  simulation  is  required  to  get  a  clear  picture  of  applica¬ 
tion  performance.  However,  simulation  is  computationally 
very  expensive,  and  it  is  highly  undesirable  to  perform  sim¬ 
ulation  inside  the  innermost  optimization  loop  during  syn¬ 
thesis.  To  avoid  such  a  simulation,  an  accurate  and  efficient 
estimator  for  throughput  is  required.  This  paper  presents  an 
efficient  estimator  for  the  throughput  of  these  systems.  Our 
work  is  in  the  context  of  self-timed  execution  of  iterative 
dataflow  specifications,  which  is  an  efficient  and  popular 
design  methodology  in  the  domain  of  digital  signal  process¬ 


ing  (DSP)  [10].  An  iterative  dataflow  specification  consists 
of  a  dataflow  representation  of  the  body  of  a  loop  that  is  to 
be  iterated  a  large  or  indefinite  number  of  times  (e.g., 
across  a  vast  stream  of  speech  samples).  In  self-timed  exe¬ 
cution,  the  assignment  of  tasks  (dataflow  graph  nodes)  to 
processors,  and  the  execution  ordering  of  tasks  on  each  pro¬ 
cessor  are  determined  at  compile-time,  and  at  run-time,  pro¬ 
cessors  synchronize  with  one  another  only  based  on  inter¬ 
processor  communication  requirements,  and  do  not  neces¬ 
sarily  synchronize  at  the  end  of  each  loop  iteration. 

In  this  paper,  we  assume  that  a  deterministic  protocol 
is  used  to  arbitrate  contention  for  communication  resources. 
We  assume  that  a  schedule  has  already  been  computed  so 
the  order  of  the  tasks  on  the  processors  is  known,  and  that 
we  are  adjusting  some  task  parameters  that  vary  the  task 
execution  times  in  order  to  perform  an  optimization  of  the 
system.  We  assume  that  reasonably  accurate  estimates  are 
available  for  the  task  execution  times,  and  for  the  variation 
of  execution  times  with  parameter  changes.  Later  in  the 
paper,  we  specifically  address  the  problem  of  finding  an 
optimum  set  of  supply  voltages  for  the  processors  in  order 
to  reduce  power  while  satisfying  a  throughput  constraint. 

2  Previous  work 

The  estimates  for  task  execution  times  can  be  obtained 
through  several  methods.  The  most  straightforward  is  for 
the  programmer  to  provide  them  while  developing  a  library 
of  primitive  blocks,  as  is  done  in  the  Ptolemy  system  [19]. 
Analytical  techniques  also  exist.  Li  and  Malik  [17]  have 
proposed  algorithms  for  estimating  the  execution  time  of 
embedded  software  in  an  efficient  manner.  Much  work  has 
been  done  on  scheduling  and  binding  methods  for  high 
level  synthesis  [12] [6] [7]  [5].  These  techniques  attempt  to 
optimize  the  schedule  makespan,  which  is  a  suitable  perfor¬ 
mance  metric  for  non-iterative  applications  or  fully-static 
implementations,  but  is  not  ideally  suited  to  the  iterative, 
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self-timed  context  that  we  address  in  this  paper.  Supply 
voltage  reduction  has  been  used  for  some  time  in  memories 
and  consumer  electronics  [11].  Chandrakasan  et  al.  [3] [4] 
have  presented  a  method  based  on  reduced  voltage  level 
operation  combined  with  architectural-level  parallelism, 
showing  that  the  throughput  can  be  maintained  while  reduc¬ 
ing  power.  Tiwari  et  al.  [16]  presented  a  technique  for  esti¬ 
mating  the  power  given  a  set  of  software  instructions.  This 
technique  can  be  used  in  conjunction  with  the  approaches 
proposed  in  this  paper  to  obtain  more  accurate  or  automated 
estimates  for  the  power  consumption  of  the  tasks  in  period 
graph  model. 

3  Period  Graph 

If  contention  is  resolved  deterministically,  and  execu¬ 
tion  times  are  constant,  then  self-timed  evolution  may  lead 
to  an  initial  transient  state,  but  the  execution  will  eventually 
become  periodic.  This  holds  because  the  multiprocessor 
may  be  modeled  as  a  finite-state  system,  and  thus,  aperiodic 
behavior  —  which  implies  the  presence  of  infinitely  many 
distinct  states  —  cannot  hold.  In  DSP  systems,  although 
execution  times  are  not  always  constant,  or  known  pre¬ 
cisely,  they  typically  adhere  closely  to  their  respective  esti¬ 
mates  with  high  frequency.  Under  such  conditions,  the 
periodic  execution  pattern  obtained  from  the  estimated  exe¬ 
cution  times  provides  an  estimate  of  overall  system 
throughput  based  on  the  task-level  estimates.  Due  to  the 
largely  deterministic  nature  of  DSP  applications,  such  sys¬ 
tem-level  performance  analysis,  and  optimization  based  on 
task-level  estimates  is  common  practice  in  the  DSP  design 
community  [10]. 

For  self-timed  systems,  when  we  apply  execution  time 
estimates  to  estimate  overall  throughput,  it  is  necessary  to 
simulate  (using  the  execution  time  estimates)  past  the  tran¬ 
sient  state  until  a  periodic  execution  pattern  (steady  state) 
emerges.  Unfortunately,  the  duration  of  the  transient  may 
be  exponential  in  the  size  of  the  application  specification 
[14],  and  this  makes  simulation-intensive,  iterative  synthe¬ 
sis  approaches  highly  unattractive. 

The  objective  in  this  paper  is  to  greatly  reduce  the  rate 
at  which  simulation  must  be  carried  out  during  iterative 
synthesis  through  the  use  of  a  novel  period  graph  model. 
Given  an  assignment  v  of  task  execution  times,  and  a  self- 
timed  schedule,  the  associated  period  graph  is  constructed 
from  the  periodic,  steady-state  pattern  of  the  resulting  simu¬ 
lation.  The  maximum  cycle  mean  (MCM)  of  the  period 
graph  (with  certain  adjustments)  is  then  used  as  a  computa- 
tionally-efficient  means  of  estimating  the  iteration  period 
(the  reciprocal  of  the  throughput)  as  changes  are  explored 
within  a  neighborhood  of  v .  In  this  context,  the  MCM  is 
the  maximum  over  all  directed  cycles  of  the  sum  of  the  task 
execution  times  divided  by  the  sum  of  the  edge  delays.  The 


MCM  can  be  computed  in  low  polynomial  time  [9]. 

The  first  step  in  the  construction  of  the  period  graph  is 
the  identification  of  the  period  from  the  simulator  output. 
This  can  be  performed  by  tracing  backward  through  the 
simulation  and  searching  for  the  latest  intermediate  time 
instant  t  at  which  the  system  state  S(ta)  equals  the  state 
S(tji)  obtained  at  the  end  of  the  simulation  (here,  denotes 
the  simulation  time  limit).  If  no  match  is  found,  then  the 
end  of  the  first  period  exceeds  tj,  and  thus,  the  simulation 
needs  to  be  extended  beyond  jy.  Otherwise,  the  region 
(Gantt  chart)  that  spans  the  interval  [tu,  tj\  constitutes  a 
(minimal)  period  of  the  simulated  steady  state. 

Here,  the  system  state  S(t)  contains  the  execution 
state  of  each  processor,  which  is  either  “idle”  or  represent¬ 
able  by  an  ordered  pair  (A,  x) ,  where  A  is  the  task  being 
executed  at  time  t ,  and  T  denotes  the  time  remaining  until 
the  current  invocation  of  A  is  completed.  The  state  S(t) 
also  contains  the  current  buffer  sizes  of  all  IPC  buffers,  as 
well  as  any  information  (e.g.,  request  queue  status)  that  is 
used  by  the  protocol  for  resolution  of  communication  con¬ 
tention.  Pseudo-code  for  our  method  of  period  extraction  is 
given  in  [20]. 

Figure  1(a)  illustrates  an  application  graph  (a  data¬ 
flow  specification  of  an  application)  along  with  a  self-timed 
schedule;  Figure  1(c)  shows  the  periodic  steady  state  that 
results  from  the  schedule  of  Figure  1(a)  and  the  execution 
time  estimates  shown  in  Figure  1(b);  and  Figure  1(d)  shows 
the  resulting  period  graph.  The  nodes  in  Figure  1(d)  that 
contain  diagonal  stripes  correspond  to  idle  time  ranges  in 
the  period,  and  solid  black  circles  on  edges  represent 
delays,  which  model  inter-iteration  dependencies.  Note  that 
the  steady  state  period  may  span  multiple  graph  iterations 
(2  in  this  example),  and  in  the  period  graph,  this  translates 
to  multiple  instances  of  each  application  graph  task. 

For  clarity  in  this  illustration,  we  have  assumed  negli¬ 
gible  latency  associated  with  IPC.  As  described  below,  non- 
negligible  IPC  costs  can  easily  be  accommodated  in  the 
period  graph  model  by  introducing  send  and  receive  tasks  at 
appropriate  points. 

As  illustrated  in  Figure  1,  the  period  graph  consists  of 
all  the  tasks  comprising  the  period  that  was  detected,  with 
the  idle  time  ranges  between  tasks  (including  those  that  are 
caused  by  communication  contention)  also  treated  as  nodes 
in  the  graph.  The  nodes  are  connected  by  edges  in  the  order 
that  they  appear  in  the  period.  An  edge  is  placed  from  the 
last  node  in  the  period  for  each  processor  to  the  first  node  in 
the  period.  This  edge  is  given  a  delay  value  of  one  (to 
model  the  associated  transition  between  period  iterations), 
while  all  of  the  other  intraprocessor  edges  have  delay  val¬ 
ues  of  zero.  This  is  done  for  all  the  processors  in  the  system. 
Our  model  utilizes  send  and  receive  nodes  for  IPC.  For  each 
IPC  point,  a  send  node  is  placed  on  the  processor  that  is 


sending  data,  and  a  corresponding  receive  node  is  placed  on 
the  processor  that  will  receive  the  data.  The  period  graph  is 
completed  by  adding  an  edge  from  each  send  node  to  its 
corresponding  receive  node. 


4  Fidelity  of  the  estimator 

We  calculate  the  fidelity  of  the  period  graph  estimator  as 
the  task  execution  times  are  varied.  Here,  we  use  the  exam¬ 
ple  of  varying  the  processor  voltages  in  order  to  change  the 
task  execution  times.  When  the  voltage  on  a  processor  is 
varied,  the  execution  time  of  a  computational  task  varies 
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where  k,,  is  the  supply  voltage,  V.  is  the  threshold  volt¬ 
age,  and  k  is  a  constant  [3].  We  use  a  value  of  0.8 volts  for 
the  threshold  voltage.  The  execution  time  pe  •  of  each  of 
these  states  in  the  original  (non-scaled)  period  graph  is  ref¬ 
erenced  to  a  voltage  V re j.  The  change  in  execution  time  of 
each  computation  node  is  found  by  taking  the  derivative. 
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where  V  is  the  new  voltage.  It  is  not  obvious,  however, 
how  one  should  adjust  the  idle  times  in  the  period  graph. 
We  separate  the  idle  nodes  into  two  sets:  contention  idles 
and  data  idles.  When  a  node  has  the  necessary  data  to  exe¬ 
cute,  but  is  idle  waiting  for  access  to  the  bus,  the  associated 
idle  node  is  classified  as  a  contention  idle.  When  a  node  is 
idle  waiting  for  its  predecessors’  data,  the  associated  idle 
node  is  classified  as  a  data  idle.  By  experimenting  with  a 
large  number  of  application  graphs,  we  found  that  we  could 
capture  the  effects  of  contention  and  obtain  the  best  fidelity 
by  zeroing  out  the  data  idles  and  leaving  the  contention 
idles  constant  as  the  computation  idles  are  scaled.  Using 
these  rules,  the  fidelity  is  calculated  as  follows: 


•  Given  an  application  graph,  construct  a  valid  schedule. 
We  used  the  dynamic  level  scheduling  algorithm  given 
in  [18].  Next,  construct  the  period  graph  as  discussed 
earlier.  Generate  N  voltage  vectors  (assignments  of 
voltages  to  the  processors  in  the  target  architecture).  For 
each  voltage  vector,  perform  a  simulation  to  determine 
the  throughput,  with  the  execution  times  of  the  tasks  on 
each  processor  given  by  (1)  according  to  the  voltage  on 
the  processor.  Also,  obtain  an  estimate  for  the  through¬ 
put  by  calculating  the  MCM  of  the  voltage-scaled 
period  graph,  in  which  the  execution  times  of  the  com¬ 
putation  nodes  are  given  by  (1),  and  the  execution  times 
of  the  idle  nodes  are  as  explained  above. 


•  Calculate  the  fidelity  according  to: 
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fij  = 


1  if  sign(Sj-  Sj)  =  sign(Mj-  M-) 
0  otherwise 


(3) 


(4) 


sign(x) 


(-1)  if  (x  <  0) 

•  0  if  (x  =  0)  ; 
1  if  (x  >  0) 


(5) 


the  Sj  s  denote  the  simulated  throughput  values;  and  the 
Mj  s  are  the  corresponding  estimates  from  the  period  graph. 

Figure  2  plots  Fidelity  for  a  six-processor  system  in 
which  the  voltage  on  the  individual  processors  can  vary 
between  plus  or  minus  five  percent.  The  x-axis  represents 


the  sum  of  the  absolute  values  of  the  voltage  changes  over 
all  processors.  Each  point  on  the  graph  is  a  fidelity  calcula¬ 
tion  for  N  =  100  voltage  vectors.  A  value  of  one  is  a  “per¬ 
fect”  fidelity.  It  can  be  seen  that  in  the  range  shown,  the 
fidelity  is  always  greater  than  0.65.  It  is  also  important  that 
the  estimator  have  a  small  error  at  each  point.  Figure  3  plots 
N 

L  [(Sl-Ml)/Si],  (6) 

i  =  1 

It  can  be  seen  that  the  error  increases  as  the  voltage  vector 
moves  away  from  the  reference  point,  and  that  the  estimate 
is  slightly  biased.  For  the  range  shown  in  the  graphs,  where 
each  processor  voltage  is  changed  by  a  maximum  of  fifteen 
percent,  the  error  is  less  than  four  percent. 


Fidelity  -  6  processors  each  changing  at  most  1 5% 
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Figure  2.  Plot  of  fidelity  (equation  3)  for  six  processor  sys¬ 
tem  vs.  magnitude  of  voltage  change  on  all  processors 
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Figure  3.  Plot  of  average  error  vs.  voltage  change  on 
processors. 


5  Using  the  Period  Graph  in  a  Joint  Power/ 
Performance  Algorithm 

An  effective  way  to  reduce  power  consumption  of  a 
processor  core  in  CMOS  technology  is  to  lower  the  supply 
voltage  level,  which  exploits  the  quadratic  dependence  of 
power  on  voltage  [3].  Reducing  the  supply  voltage  also  has 
the  effect  of  decreasing  the  clock  speed  and  increasing  cir¬ 
cuit  delay.  The  circuit  delay  can  be  modeled  by  (1).  The 
power  consumption  is  given  by 

p  =  a^y,  (7) 

where  /  is  the  clock  frequency,  CL  is  the  load  capacitance, 
and  a  is  the  switching  activity  [3].  To  accommodate  the 
possibility  of  putting  processors  in  states  of  lower  switching 
activity  during  idle  periods,  our  model  includes  a  parameter 
a-dle  for  the  idle  states,  and  a  parameter  a noL^fdle  f°r  t'le 
computational  tasks,  where  aidle  <  anon  _  idle  .  A  more 
detailed  power  analysis  could  assign  a  different  a  for  each 
computational  task  if  that  data  were  available.  A  different 
power  optimization  technique,  which  can  be  used  in  con¬ 
junction  with  the  voltage  scaling  technique  presented  here, 
utilizes  a  nearly  complete  processor  shutdown  during  the 
idle  periods  [8] [15].  In  our  model,  this  would  correspond  to 
a- jig  -  0  .  Our  model  for  the  power  is  the  average  energy 
consumption  per  graph  iteration  period.  This  corresponds  in 
a  typical  DSP  system  to  the  average  energy  required  to  pro¬ 
cess  one  sample.  Here,  the  energy  of  each  node  equals  its 
power  times  its  execution  time. 

In  a  system  consisting  of  multiple  processors,  one  has 
the  ability  to  choose,  within  a  certain  range,  the  (fixed) 
operating  voltage  on  each  processor.  This  opens  up  an  addi¬ 
tional  degree  of  freedom  that  can  be  exploited  to  minimize 
the  system  power  consumption.  By  choosing  a  lower  volt¬ 
age  of  a  processor  that  is  executing  tasks  that  are  not  on  the 
critical  path,  the  throughput  can  remain  unchanged  while 
the  overall  power  consumption  is  reduced.  In  general,  a 
combination  of  raising  voltages  on  some  processors  while 
lowering  others  can  yield  the  most  attractive  power/perfor¬ 
mance  solution. 

When  applying  voltage  scaling  to  a  multiprocessor 
system,  the  valid  solution  space  is  typically  much  too  large 
to  search  by  brute-force  methods.  In  addition,  since  there  is 
no  general  analytical  formula  for  calculating  the  throughput 
of  these  systems  in  the  presence  of  communication  resource 
contention,  each  candidate  solution  must  either  be  simu¬ 
lated  or  estimated  using  some  heuristic. 

6  Genetic  algorithm  formulation 

To  demonstrate  the  general  utility  of  the  period  graph 
based  performance  estimation  approach,  we  incorporated  it 
into  two  significantly  different  probabilistic  search  tech¬ 
niques  to  derive  two  different  algorithms  for  systematic 
voltage  scaling.  The  first  algorithm  presented  utilizes  the 


framework  of  genetic  algorithms  (GAs)  [1],  The  specific 
GA  explored  here  consists  of  an  inner  GA  nested  within  an 
outer  GA.  The  inner  GA  performs  a  local  search  around  a 
point  from  the  population  of  the  outer  GA,  using  the  MCM 
of  the  period  graph  in  its  objective  function  as  an  estimate 
for  the  throughput.  A  period  constraint  T  •  t  is  given 
as  an  input  to  the  optimization  problem,  where  the  period  is 
the  reciprocal  of  the  throughput.  The  objective  function  cal¬ 
culates  the  power  consumption  associated  with  each  solu¬ 
tion  by  calculating  the  total  energy  per  period,  as  discussed 
earlier.  If  the  period  associated  with  a  solution  violates  the 
period  constraint  (^s(Mution  >  ^constraint)  ’  the  Power  con¬ 
sumption  is  multiplied  by  a  large  penalty  factor 
exp  ( 1 00  ( Tsolution  -  Tconstraint) ) .  The  GA  attempts  to  mini¬ 
mize  this  objective  function. 

In  the  outer  loop,  a  population  of  N  ,  voltage  vec¬ 
tors  is  generated.  A  simulation  is  run  and  a  period  graph 
constructed  for  each  of  these  outer  loop  voltage  vectors.  For 
each  of  the  outer  loop  voltage  vectors,  a  new  inner  loop 
population  is  generated  such  that  |Vouter(.  -  Vinner(|  <  e 
for  i  e  AT  c ,  where  IV  c  is  the  number  of  processors, 
Vouter(.  is  the  voltage  on  processor  i  in  the  outer  popula¬ 
tion,  Vinner(.  is  the  voltage  on  processor  i  in  the  inner  pop¬ 
ulation,  and  e  is  a  user-defined  threshold.  The  inner 
population  size  is  Ainner.  The  inner  GA  then  performs  a 
local  search  using  this  population  for  a  number  of  genera¬ 
tions  Generations ;  in  an  attempt  to  find  a  locally  opti¬ 

mal  voltage  vector.  The  inner  GA  uses  the  MCM  of  the 
period  graph  in  its  objective  function.  After  an  invocation 
of  the  inner  GA  is  finished,  one  simulation  is  performed 
using  the  resulting  voltage  vector,  and  the  actual  throughput 
for  this  point  is  used  to  compute  its  fitness.  The  outer  loop 
voltage  vector  is  then  replaced  with  this  locally-optimized 
voltage  vector  for  use  in  the  next  outer  loop  generation.  The 
outer  loop  is  run  for  a  number  of  generations 
Generations  outer. 

7  Simulated  annealing  algorithm 

Simulated  annealing  is  another  well-known  method 
for  searching  large  design  spaces.  Using  a  standard  simu¬ 
lated  annealing  package  [2],  we  have  implemented  an  alter¬ 
native  version  of  period-graph-based  voltage  scaling 
optimization.  The  objective  function  here  is  the  same  as  for 
the  genetic  algorithm.  The  system  is  first  simulated  with  an 
initial  voltage  vector  V-  =  LSVy ,  and  the  period  graph  is 
built.  In  order  to  insure  that  the  period  graph  will  be  a  good 
enough  estimator,  a  resimulation  threshold  T  is  main¬ 
tained.  The  difference  between  the  current  input  C V  -  to  the 
objective  function,  and  the  voltage  vector  LS 'A  corre¬ 
sponding  to  the  simulation  used  to  compute  the  current 
period  graph,  is  calculated.  If 


N 


N 


V,-LSV,. 


LSV, 


>  T, 


(8) 


the  graph  is  resimulated  using  CV  .  The  period  graph  is 


rebuilt,  and  CV  ■  — >  LSV  ■ .  For  T 


0 ,  the  graph  will  be 
resimulated  every  time,  and  the  period  graph  will  offer  no 
speed  advantage.  The  larger  the  value  of  T,  the  less  often 
the  graph  will  be  resimulated,  and  the  faster  the  optimiza¬ 
tion  algorithm  will  perform.  However,  when  T  is  too  large, 
the  fidelity  of  the  period  graph  estimate  will  be  unaccept¬ 
ably  low  and  the  quality  of  the  final  result  will  suffer.  Based 
on  our  experiments  with  a  number  of  graphs,  the  optimal 
value  of  T  is  highly  application-dependent,  but  a  value  of 
T  =  0.1  (10%)  generally  gives  good  results. 

8  Results 

Figure  4  shows  an  example  of  the  reduction  in  power 
resulting  from  the  genetic  optimization  algorithm  on  the 
fftl  application  graph.  The  parameters  of  the  GA  were 

^outer  =  dinner  =  50  >  Generations outer  =  10  > 

Generations ;  =  20 .  The  local  search  voltages  were 

constrained  to  be  within  five  percent  of  the  corresponding 
outer  loop  voltages.  The  period  constraint  was  calculated 
by  simulating  the  system  with  all  six  processors  operating 
at  voltage  Vre j.  For  this  example,  the  system  power  con¬ 
sumption  was  reduced  by  42%,  while  maintaining  the  origi¬ 
nal  throughput.  To  evaluate  the  advantage  of  the  period 
graph  approach  over  using  brute-force  simulation,  a  second 
nested  GA  was  implemented.  This  algorithm  was  identical 
to  the  algorithm  discussed  above,  except  that  the  inner  loop 
did  not  use  the  period  graph  estimate  for  the  throughput. 
Instead,  each  voltage  vector  was  evaluated  by  simulation. 
This  algorithm  consumed  26  times  more  CPU  time,  and 
produced  similar  results,  as  shown  in  Figure  5. 


Genetic  algo,  power  reduction  (fixed  throughput  constraint)  using  period  graph 


Iteration  Number  (1 000  generation/iteration)  (6  minutes  cputime/iteration) 


Figure  4.  Plot  of  1 -(optimized  power)/(initial  power) 
vs.  genetic  algorithm  iteration  using  the  period  graph 
estimator. 


Figure  6  summarizes  the  power  reduction  results  for 
the  simulated  annealing  algorithm  applied  to  a  fast  Fourier 
transform  (FFT)  application  graph,  for  different  values  of 
the  resimulation  threshold  T .  It  can  be  seen  that  as  T  is 
increased,  the  algorithm  progresses  more  quickly.  The  sim¬ 
ulated  annealing  algorithm  begins  with  a  ‘melting’  routine, 
where  the  temperature  is  increased  until  a  phase  change  is 
detected.  The  initial  flat  part  of  the  curves  corresponds  to 
the  time  spent  in  the  melting  routine.  We  have  found  that 
for  values  above  20%,  the  period  graph  is  not  a  good 
enough  estimator  and  the  algorithm  does  not  converge. 

Table  1  summarizes  the  power  reduction  for  the  simu¬ 
lated  annealing  algorithm  for  several  additional  applica¬ 
tions  using  different  values  of  the  resimulation  threshold. 
At  the  start  of  the  optimization,  all  processor  voltages  were 
set  at  5  volts.  The  throughput  at  this  point  was  used  as  the 
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Figure  5.  Plot  of  1 -(optimized  power)/(initial  power)  vs. 
genetic  algorithm  iteration  using  simulation  only. 
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throughput  constraint.  In  the  table,  the  first  two  rows  corre¬ 
spond  to  two  different  FFT  implementations,  mus  refers  to  a 
music  synthesis  algorithm,  qmf  refers  to  a  quadrature  mirror 
filter  bank,  meets  is  a  measurement  application,  and  the  last 
three  rows  correspond  to  graphs  that  were  generated  using 
Sih’s  algorithm  for  randomly  generating  application  graphs 
[13].  The  numbers  in  parentheses  give  the  numbers  of 
nodes  in  these  applications.  The  optimization  was  per¬ 
formed  for  a  fixed  time  of  30  minutes  in  each  case.  The 
optimum  resimulation  threshold  was  between  2%  and  10% 
in  all  cases.  For  T  =  0.25  ,  the  period  graph  is  not  a  good 
estimator  and  none  of  the  results  returned  during  the  opti¬ 
mization  algorithm  satisfied  the  throughput  constraint.  For 
the  largest  graph,  the  fixed  simulation  time  was  not  long 
enough  to  make  much  improvement,  but  the  best  result 
occurred  for  T  =  0.1 ,  where  the  simulations  are  less  fre¬ 
quent.  Table  2  summarizes  the  power  reduction  for  the 
genetic  algorithm  with  and  without  using  the  period  graph, 
with  a  fixed  compile  time  of  one  hour. 

9  Conclusion 

This  paper  has  explored  a  period  graph  model  that 
enables  efficient  voltage  scaling  optimization  for  self-timed 
implementations  of  iterative  applications.  The  period  graph 
can  be  used  as  a  computationally  efficient  estimator  for  the 
throughput  in  multiprocessor  systems  in  which  communica¬ 
tion  contention  renders  exact  analysis  too  time-consuming. 
This  model  is  especially  useful  in  iterative  synthesis  tech¬ 
niques,  such  as  those  based  on  probabilistic  search.  Our 
paper  has  demonstrated  effective  voltage  scaling  techniques 
based  on  incorporating  the  period  graph  into  genetic  algo¬ 
rithm  and  simulated  annealing  formulations.  Other  optimi¬ 
zations,  such  as  exploiting  memory/speed  trade-offs  of  the 
individual  tasks,  are  also  possible.  These  may  be  more 
appropriate  to  the  genetic  algorithm  and  simulated  anneal¬ 
ing  framework,  as  a  larger  set  of  independent  moves  is 
available  during  optimization.  Other  useful  directions  for 
further  work  include  integrating  the  period  graph  model 
into  the  scheduling  phase,  rather  than  restricting  its  use  to 
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Table  1 .  power  reduction  for  fixed  computation  time. 


Figure  6.  Plot  of  (optimized  power)/(initial  power)  vs.  time 
for  simulated  annealing  algorithm  on  FFT3  application. 


voltage  scaling  of  fixed  schedules,  and  the  investigation  of 
adaptive  methods  for  dynamically  adjusting  the  frequency 
of  resimulation. 
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